Skip to content

Conversation

@Kump3r
Copy link

@Kump3r Kump3r commented Sep 25, 2025

Rendered
Previously discussed as well, but I think it makes sense to collect comments on this again, due to industry standards and an overall need for this. Should we open a discussion to collect interest/opinions of the community as well?

"status": "healthy/unhealthy",
"details": {
"database": "healthy/unhealthy",
"workers": "healthy/unhealthy",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"workers" as a single entry does not convey much meaningful information IMO. A list of the status of each worker might be more useful. Also, the semantics of the general "status" should be clarified. What is considered a healthy instance ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah haven't though about it that way, thanks for the input. Extending it even a bit further with our 1on1 discussion, we might not even need information about the workers or database, but more or less whether API is working and whether workloads are schedulable. So not looking at specific interfaces, or services bu more or less. Is the ATC working and if so, can it schedule workloads. An example that comes to mind is a systematic/periodic one off build which is tracked by this backend and reports a simple "run-jobs: healthy". Should the API be not reachable the endpoint will be down anyway. So in that case it would look like:

"status": "healthy",
"run-jobs": "healthy"

Should it fail to tun jobs in a certain time frame the status will change to unhealthy.
Does that more or less sum it up, or am I missing something?

@Kump3r
Copy link
Author

Kump3r commented Sep 26, 2025

I had an offline discussion with a stakeholder who also raised a valid question that can be added to the document:

Q: "How is Concourse on K8s determining the state of the pods?"
A: There are liveness and readiness probes defined in the chart, which make a http request to the /api/v1/info endpoint.
The idea of the change would be to have a more dedicated endpoint that could build a bit on that static endpoint checking by also considering the status which can change dynamically.

Thanks for the question I will also add this to the document when a bit more questions are gathered!

@Kump3r
Copy link
Author

Kump3r commented Sep 26, 2025

Also part of an offline follow-up with another user:

"You can think of Google status health" to have a more red/green status pointing towards potential problems with the application.

I think a GUI change is a bit out of scope of this RFC, albeit this RFC would enable this to be easily extended in the UI, so it is worth writing it done as a possible future follow-up

@taylorsilva
Copy link
Member

taylorsilva commented Sep 26, 2025

Totally for this. A lot of my questions come around implementation, which I see already written down in the POC PR concourse/concourse#4818. I think it would be nice for this RFC to define specifically what we want the Health JSON response to look like.

A Concourse web node is made up of a bunch of micro-service-ish components. We could potentially display the health of all of these components (see components.go). There may be some exceptions in that file, but most of these components are run "globally" across one of the web nodes based on workload the web node is handling. They're load-balanced!

There are some services on the web node that are not load-balanced, like the TSA and API. Those are always running on all web nodes.

A detailed health response could look something like this, which I think would accurately describe the entire Concourse cluster:

{
    "status": "...",
    "workers": {
        "worker-1": {
            "baggageclaim": "...",
            "garden": "..."
        }
        ...
    },
    "web-nodes": {
        "web-1": {
            "api": "...",
            "tsa": "...",
            "db-connection": "...",
            ...
        }
        ...
    },
    "global-components": {
        "log-collector": "...",
        "lidar": "...",
        "secret-management": "...",
        "scheduler": "..."
        ...
    }
}

I wouldn't expect an initial PR to fully implement all of that though. I think this RFC could clearly define what we want the end goal to look like and then slowly work towards it through multiple PR's. WDYT?

@Kump3r
Copy link
Author

Kump3r commented Sep 27, 2025

I agree, I really like the idea and I am all for having an easy to reach status board for all of the components. One of the key questions that come to mind is when the overall status should change to not healthy, as albeit each component having its share of work to be done, if they flap, or are unstable it shouldn’t mean the instance is not operational, but rather somewhat degraded. So more, or less building on what you wrote, it would be great before closing the RFC to have the json response and the conditions that are hard requirement for a healthy instance figured out. Thanks to all for the feedback, I like the overall direction of the discussions here. Once we have q couple of more comments, I will add all the discussions to the document.

@DimitarKapashikov
Copy link

If you plan to reuse the same endpoint for Kubernetes health checks, you can introduce a parameter to differentiate between web and worker nodes. For example:

  • /health?component=web
  • /health?component=workers

It could also be extended to the pod level, such as:

  • /health?component=workers-1
  • /health?component=workers-n

This way, Kubernetes can identify and restart individual pods if they become unhealthy.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 4, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants